Searching COVID-19 Clinical Research Using Graph Queries: Algorithm Development and Validation

Background Since the beginning of the COVID-19 pandemic, >1 million studies have been collected within the COVID-19 Open Research Dataset, a corpus of manuscripts created to accelerate research against the disease. Their related abstracts hold a wealth of information that remains largely unexplored and difficult to search due to its unstructured nature. Keyword-based search is the standard approach, which allows users to retrieve the documents of a corpus that contain (all or some of) the words in a target list. This type of search, however, does not provide visual support to the task and is not suited to expressing complex queries or compensating for missing specifications. Objective This study aims to consider small graphs of concepts and exploit them for expressing graph searches over existing COVID-19–related literature, leveraging the increasing use of graphs to represent and query scientific knowledge and providing a user-friendly search and exploration experience. Methods We considered the COVID-19 Open Research Dataset corpus and summarized its content by annotating the publications’ abstracts using terms selected from the Unified Medical Language System and the Ontology of Coronavirus Infectious Disease. Then, we built a co-occurrence network that includes all relevant concepts mentioned in the corpus, establishing connections when their mutual information is relevant. A sophisticated graph query engine was built to allow the identification of the best matches of graph queries on the network. It also supports partial matches and suggests potential query completions using shortest paths. Results We built a large co-occurrence network, consisting of 128,249 entities and 47,198,965 relationships; the GRAPH-SEARCH interface allows users to explore the network by formulating or adapting graph queries; it produces a bibliography of publications, which are globally ranked; and each publication is further associated with the specific parts of the query that it explains, thereby allowing the user to understand each aspect of the matching. Conclusions Our approach supports the process of query formulation and evidence search upon a large text corpus; it can be reapplied to any scientific domain where documents corpora and curated ontologies are made available.


Introduction
Since the COVID-19 pandemic outbreak in early 2020, important clinical research efforts have been targeted at understanding the COVID-19 disease.More than 1 million studies have been collected within the COVID-19 Open Research Dataset (CORD- 19), a corpus of manuscripts created to accelerate the research against the disease.Their related abstracts hold a wealth of information that remains largely unexplored and difficult to search due to its unstructured nature.
Searching over the literature is a nontrivial task, as it strongly relies on the quality of the data corpus, the characteristics of the search portal, and the language used to express the search.
Keyword-based search is the standard search approach, which allows users to retrieve the documents of a corpus that contain some of the words in a specified target list [1,2].However, this type of search lacks visual support for the task and is not suitable for expressing complex research queries or compensating for missing specifications.
The development of frontend tools and visualizations for COVID-19 knowledge graphs has been motivated by several works [3,4].We then explored the use of small graph-based queries that can be built visually [5] to empower a literature exploration tool; the GRAPH-SEARCH system stems from this motivation, providing both a visual language to express search queries and a friendly tool to explore relevant publications, which highlights the relationships between the original graph queries and an underlying corpus of scientific evidence, in the spirit of literature-based discovery [6].
To support this idea, the underlying textual corpus must first be analyzed and enriched; in our approach, the CORD-19 was expressed in the form of a co-occurrence network.First, we annotated all the abstracts with terms from the Unified Medical Language System (UMLS) [7] and the Coronavirus Infectious Disease Ontology (CIDO) [8].This step closely aligns with classical work on ontology-based annotation (refer to Semantic MEDLINE [9] and our previous study on genomic metadata annotation [10,11]).Second, we built a comprehensive co-occurrence network that includes all relevant clinical and biological concepts mentioned in the corpus, linking them based on their co-occurrence in given abstracts.
The visual language used to express a query over the network describes concepts as nodes and their copresence within research abstracts as undirected edges; some concepts are associated with medical conditions, whereas others are associated with treatments or biological entities.We also allow modifiers.Queries run on the network may correspond to the expressed graph pattern or to a selected subpart.
The query semantics corresponds to extracting scientific evidence (ie, publications) from the corpus, in support of the existence of the relationships linking the expressed concepts; each search process extracts the references that best explain the relationships occurring within the query.When a specified path is not present in the co-occurrence network, alternative scored and ranked shortest paths connecting the nodes expressed in the query are proposed to the user (refer to the Methods section).The search output provides a ranking of references because of their weight, summing up the support that they provide to several relationships in the query.
Our GRAPH-SEARCH implementation is supported by a graphical interface (refer to the Data Availability section) that allows the user to express the queries and to interpret the results in terms of concepts explained by each discovered reference, thus enabling the users to better qualify the query during the interaction; in addition, users can read the textual abstracts of the retrieved references.Such interactive exploration of the search space allows for exploring assumptions and for progressively adapting them as a result of existing evidence.
The manuscript is organized as follows: we first describe the CORD-19, the characteristics of the co-occurrence network representing CORD-19 abstracts, the technological process of building the network, the graph search operation, and the web user interface that allows us to express graph queries and explore the retrieved results.We then present a series of example use case (UC) queries relevant to COVID-19 research and review the current state of the art.We then evaluate the benefits of using our GRAPH-SEARCH as opposed to full-text indexed databases and keyword search.Finally, we draw our conclusions.

Methods
The CORD-19 Corpus CORD-19 [12] is a corpus of academic publications about COVID-19 and related coronavirus research; it was released and maintained by the Allen Institute for AI in collaboration with The White House Office of Science and Technology Policy and other partners.Published articles and preprints were collected from several archives, including PubMed, PubMedCentral, bioRxiv, and arXiv; since its release, it has served as the basis of many COVID-19 text mining and discovery systems [12].The final release of June 2, 2022, indexes >1 million publications.As summarized in Figure 1, approximately 79% of the documents in CORD-19 have an abstract.Out of them, around 41% have a full-text JSON file available, while <11% of available full-text publications have no abstract in the metadata table.Thus, we decided to focus on data set records with an abstract.The file containing the metadata of publications in the data set is a comma-separated table (CORD-19 metadata.csv)including the following:

•
A unique identifier cord_uid for a cluster of different records of the same publication-upon it, we performed deduplication and subsequent reconciliation of the other metadata of the cluster into a single record.
• Title of the publication-we detected the language and filtered out those not in English.
• Abstract of the publication-only records with an actual abstract were retained.
• Publish_time-the distribution of publication times, shown in Figure 2, shows that COVID-19 publications increased in the first half of 2020.Spikes at the beginning of each year correspond to publications whose publish time is incomplete (ie, only the year field was filled).Publications before 2020 that are concerned with Middle East Respiratory Syndrome, Severe Acute Respiratory Syndrome, and the coronavirus were removed.

•
Journal's abbreviated name-fuzzy matching of the abbreviated names was performed with a list of full names obtained from Scopus [13].
• Authors and DOI of the publication • Number of citations received (ie, through numCitedBy), obtained by SemanticScholar application programming interfaces (APIs) [14] Records from CORD-19 are already harmonized (refer to the study by Wang et al [12]), resulting in distinct cord_uid keys.However, several records of the same publication are included, with different metadata.We deduplicated them and retained just 1 record (the one published in a peer-reviewed journal, if available, else the richest one in metadata).

Co-Occurrence Network
The co-occurrence network was built to support graph search; it consists of entities and relationships mined from the title and abstract fields of the metadata table.For building it, we considered 2 sources: UMLS and CIDO.UMLS [7] is a generic source that includes many vocabularies and covers the entire spectrum of medicine; CIDO [8]  As attributes, entities of the co-occurrence network include the name, an Umls_id when the entity is extracted from UMLS, and the frequency associated with the entity (ie, the number of documents in CORD-19 capturing that concept).Relationships in the co-occurrence network express the co-occurrence of 2 entities in ≥1 documents of CORD-19.Each relationship has the following attributes: a name (ie, built as concatenation in alphabetic order of the names of the entities that co-occur); a frequency (ie, the number of abstracts that mention such co-occurring entities); and several statistical indicators of the relationship's strength within the corpus, such as the pointwise mutual information value (comparing the relative frequency of 2 concepts occurring together in the text to the probability of either concept occurring independently [15]), the normalized pointwise mutual information (NPMI) value (normalized by the Shannon self-information, ranging from −1 to 1 [16]), and the Cramer V value (measuring the statistical significance of the co-occurrence between 2 entities [17]).
Figure 3 illustrates the process of ontology creation at a conceptual level.The process applies to textual abstracts (refer to Figure 3 where we consider an excerpt of the textual abstract of the study by Logette et al [18]) and consists of an entity recognition task aiming to extract the known ontological terms (ie, either from UMLS or from CIDO), followed by an entity linking task; eventually, we produce a co-occurrence network, whose entities are extracted terms and whose relationships connect entities that co-occur, weighted by the strength of the co-occurrence.Next, we detail the data extraction and transformation process.Ontological terms are recognized in textual abstracts using entity recognition; then, this process is reiterated with approximately 660,000 publications' abstracts.Terms are connected to each other using entity linking; each relationship between entities is associated with several properties representing the co-occurrence weight, using different statistical methods.The generated connected co-occurrence network has approximately 128,000 concepts and approximately 47 million relationships.ACE2: angiotensin-converting enzyme 2; CIDO: Coronavirus Infectious Disease Ontology; NPMI: normalized pointwise mutual information; UMLS: Unified Medical Language System.

Overview
The data provision workflow is represented in Figure 4 [18]; it follows the extract-load-transform paradigm.Data were extracted from CORD-19 and loaded into the data storage system.The pipeline produces 3 data objects: the co-occurrence network; the metadata table; and the inverted index, that is, a simple postings list whose keys are the relationships of the co-occurrence network and whose elements are links to the relevant publications where such relationships co-occur.Other data tables contain intermediate results of the extraction and curation of the entities, that is, the nodes of the co-occurrence network and the computation of the co-occurrence measures used for the relationships.For storing data tables, we selected the MariaDB relational engine [19]; for storing the co-occurrence network, we selected the Neo4j graph data engine [20].

Data Loading
Three tasks apply to raw CORD-19 data and produce a metadata table.Metadata were obtained by using the "GET metadata" from the S3 bucket of Allen Institute of AI; then, we performed a "Wrangling and cleaning" step and the "Augment and load" step on the cleaned metadata table with information from the external APIs.

Entities Mining and Linking
The "Mine entities and link" task takes as input the curated and augmented metadata table and produces the raw_entity table.With a single pass over the title and abstract, we performed typical information retrieval steps such as lexical analysis, removal of stopwords, stemming, and lemmatization.Then, we performed named entity recognition (NER), consisting of the identification and extraction of entities from unstructured text and linking to UMLS and CIDO; specifically, we used the en_core_sci_lg model of the scispaCy Python package.The selected model is particularly suited for processing English-based scientific literature, providing an approximately 785,000 word vocabulary with 600k word vectors, with a declared F 1 -score for mentions of 68.67 (refer to the study by Nuemann et al [21] for details on the achieved performances).Entities are linked to UMLS and CIDO by associating each concept with the UMLS identifier (with its related type and macroclass) and the CIDO identifier (if available).

Entity Curation
The "Entity curation" task aggregates the occurrences in the raw_entity table and outputs the entity_materialized table, collecting all the entities to be used as nodes of the co-occurrence network.In this pass, we excluded the occurrences of the entities that score a low similarity with UMLS or CIDO concepts; we used a normalized string similarity measure based on the Levenshtein distance and a threshold value of 0.7.We also included within entities some utility terms that indicate level modifiers (eg, "high" and "increased") or causative connectors (ie, "induces").Eventually, we added the entity type and macrocategory using their names in UMLS.

Link Mining
The "Link mining and inverted index creation" task uses the raw_entity table and the entity_materialized table to generate the bigram table (ie, information on the links of the co-occurrence network) and the bigram_publications table that we use as an inverted index in the information retrieval process.
A co-occurrence is a relationship between 2 concepts, and it exists when those 2 concepts occur in the same document.Each relationship is named using the convention "X.name-Y.name,"where X and Y are the 2 concepts expressed as nodes, which it connects, and X.name precedes Y.name alphabetically.
We designed a greedy algorithm-optimized for big data contexts-to extract the relationships in a single pass over the publications.This algorithm requires 2 read-only lookup tables, built before the execution: publication_entities (ie, for each publication a list of mentioned entities) and entity_publications (ie, for each entity, a list of mentioning publications).The complexity of the algorithm is o(N 2 ), where N is the number of entities in the entity_materialized list; in practice, the number of required comparisons is low, as the number of entities in each publication is much lower than the total number of entities selected in the "Entity Curation" step.

Graph Consolidation
The "Graph consolidation" task selects data from the entity_materialized and bigram tables and migrates them to the Neo4j instance to create the co-occurrence network.
The nodes are curated in the previous "Entity Curation" step.The relationships of co-occurrence are chosen at this stage, based on their NPMI, which is the point estimate of the Mutual Information, normalized by the Shannon self-information (taking a value between −1 and +1); this compares the probability that the 2 entities occur together.We exclude the relationships with NPMI≤0, as a nonpositive NPMI indicates that the relationship is not significant.
The resulting co-occurrence network has 128,249 entities and 47,198,965 relationships, extracted from 662,105 initial publications.Using the Neo4j Graph Data Science library [22], we verified that the graph is a unique connected component; such a condition is essential to ensure that every possible formulated graph query can be matched on the co-occurrence network.

Graph Query Search
A graph query Q is a connected graph formed by nodes and undirected relationships, where nodes are the set of entities appearing in Q and rels(Q) is a set of arbitrary relationships connecting some pairs of entities in Q.A subgraph Q' is simply a connected subset of the nodes and relationships of Q.The search strategy is composed of 2 steps: matching of graph query against the co-occurrence network and extracting the relevant publications.
Graph query matching is the operation of comparing the graph query Q with the co-occurrence network N created along the procedure described in the Data Provisioning and Co-Occurrence Network Construction section.By construction, each entity in Q is contained in N, whereas relationships in rels(Q), arbitrarily created in Q, may not be present in N.Both Q and N are connected graphs with undirected relationships; then, matching Q within N can be seen as an instance of inexact graph matching [23].
Figure 5 guides the intuition of the matching operation.A graph query A-B (in blue) is searched over a co-occurrence network (in white).No direct relationship exists between A and B on the network.However, several alternative finite paths exist (ie, A-X-B, A-Y-Z-B, or A-V-Y-Z-B).Among these, A-X-B is found to be the "shortest path" between A and B, as its length or distance (in green) equals 1.All entities in Q are matched in N; then, for each relationship r in rels(Q), connecting nodes α and β, we retrieve the "shortest paths" within N that connect α and β, that is, a chain of relationships r 1 ', r 2 ', ..., r n ', where r' is in rels(N), r 1 ' starts from node α, and r n ' ends in node β.
Shortest paths are computed using the All Pairs Shortest Path function allShortestPaths available in Cypher, Neo4j v4.4 [20].Candidate shortest paths are ranked by the average of the NPMI property associated with each relationship along the path; we retain the top 10 paths in the ranking.
We refer to the set of candidate shortest paths as expansion; the selection of exactly 1 preferred path among the candidates of the expansion is performed interactively by the user of the search system, as it is strictly domain or context specific.
Relevant publications extraction corresponds to the retrieval of the publications that mention concepts of the matched graph, using the inverted index.We access the inverted index by relationship name, using either r when it appears in the relationships rels(N) of the co-occurrence network or all the relationships r 1 ', r 2 ', ..., r n ' appearing in the specified path(r).The score of a publication P relative to a query Q (ie, the number of explained relationships) is computed as follows: The addends of the external summation represent a score assigned to each relationship r in Q.Each addend captures how well P represents r; it is equal to 1 if P directly mentions the relationship of Q (eg, when path(r)=r', with length 1) or if P mentions all the relationships of path(r).Otherwise, it equals a fraction of 1, counting the number of relationships r 1 ', r 2 ', …, r n ' of path(r) mentioned in P, divided by the length of path(r).
Extracted publications are ordered by their score; they are further described by other properties, such as the sum of the NPMI of all the mentioned relationships and the date of publication.

Running Example
Consider Figure 6 as an example of the four steps performed during the search: 1. Create graph query (Figure 6A): Nodes are chosen among the concepts existing in the co-occurrence network; node names can be found through a dedicated browser working either by autocompletion of user-typed content (ie, matching terminologies concepts) or by selection of category and type and the contained concepts; search on multiple terminologies at the same time is allowed.For each concept, we provide a description and ID from the original source.
Relationships can be drawn between any pair of nodes. 2. Find paths (Figure 6B): For each pair of entities connected by a relationship in the graph query, the Neo4j graph is queried to find the shortest paths (at most 10) with top average NPMI scores. 3. Select paths (Figure 6C): The user selects the most relevant path for each original relationship that has been expanded.In Figure 6D, we observe that 5 expansions are produced: the first publication scores 1 in 4 expansions and 1/2 in the expansion at the top-right end of the graph query.Indeed, publication 1 only includes the relationship (AngII)-(1,0), which is half of the selected shortest path that connects (AngII) and (Vascular Permeability).
The second publication scores 0 in 1 expansion, as there is no path between (AngII) and (Vascular Permeability); 1 in 3 expansions; and 2/3 in the expansion at the left end of the graph query; the relationship (SARS-CoV-2)-(1,0) is not mentioned.

Ethical Considerations
Ethics approval was not applicable for this study.

Web Interface
With GRAPH-SEARCH, the researcher can express a query in the form of a graph query on a web interface and retrieve a list of CORD-19 publications that best correspond to the query.During the search process, each link in the original graph query is expanded and matched with the co-occurrence network.When a relationship in the query is not available in the co-occurrence network, an expansion may suggest that several sets of concepts can explain a relationship in the original graph query; therefore, 10 ranked paths are proposed to the user, who may express a preference according to their interest.After selecting 1 path for each expanded relationship, GRAPH-SEARCH provides a list of publications ranked by the number of explained relationships of the original graph query.
The GRAPH-SEARCH application service exposes a web user interface to query the co-occurrence network and exploit the graph-driven search methodology described in the Graph Query Search section; it contains a backend (ie, web server that exposes a Representational State Transfer Application Programming Interface for high-level retrieval operations) and a frontend (ie, visual interface that exploits the Representational State Transfer Application Programming Interfaces to use the backend).
The web interface has been designed and implemented following the major steps of the algorithm described in the Running Example subsection above.The user experience has been modeled as a multipage application; for each step of the retrieval strategy, different API services and a different page were implemented.
The frontend is built with the Vue.js framework and the D3.js library for graph illustrations; instead, the backend is written in Python and includes two components: 1. Swagger_server, which implements the web service logic, interfaces, and the models necessary to handle the persistence and asynchronicity behaviors of a multiuser system.We used the connexion framework, a flask-based web framework, and SQLAlchemy as the database abstraction layer. 2. Core, which implements the retrieval strategy and provides high-level programming interfaces for it.This package has been designed as an independent library that can be embedded in other applications, as it has been done with the backend service.Its implementation relies on several Python libraries, such as Neo4j, networkx, and SQLAlchemy.

UC1: Genetic Mechanisms of Critical Illness in COVID-19
Pairo-Castineira et al [23] revealed previously undescribed molecular mechanisms of critical illness in patients with COVID-19 with genome-wide studies.The results of such studies may provide therapeutic targets to modulate the host immune response to promote survival.Inspired by this publication, we create a graph query including relevant human genes that are related to higher or lower severity of COVID-19 (eg, IFNAR2, CCR2, and TYK2 genes), and we link them to the change in the severity of the disease (Figure 7A).Since the research idea is broad, we start the exploratory process focusing on a subgraph of the graph query (refer to the nodes in red selected in Figure 7A).Here, we only consider the effect of the increase of expression in the CCR2 gene.Figure 7B shows how GRAPH-SEARCH expands the path between the concepts "High" and "Gene Expression" (not otherwise connected in the co-occurrence network).According to NPMI values, the most relevant concept connecting them is "Up-Regulation (Physiology)."Figure 7C shows that the path going through this concept has been selected by the user among the other proposed.The Results page (Figure 7D) shows a publication (Teixeira et al [24]) that covers 4 (80%) out of 5 explained relationships of the original graph query.This means that out of the 5 original relationships of the selected portion of the graph query, only 4 (80%) are explained by the publication (all except for the one between "Gene Expression" and "High").At this point, the user can consider other portions of the graph query or the entire query.

UC2: COVID-19 and Cystic Fibrosis
Cystic fibrosis is a disorder that affects mostly the lungs, the digestive system, and other organs in the body.It is widely known that COVID-19 also affects the respiratory system.

UC3. COVID-19 and Nonsteroidal Anti-Inflammatory Drugs
During the second year of the pandemic, interest arose in the possibility of intervening at the onset of mild to moderate COVID-19 symptoms in outpatients (instead of hospitalized patients); it was suggested that this could prevent the progression to a more severe illness and long-term complications.More specifically, Perico et al [26] investigated the use of anti-inflammatory drugs, especially nonsteroidal anti-inflammatory drugs (NSAIDs) as a therapeutic strategy.In our graph query, we include the following as main concepts: "COVID-19" (C5203670), "Outpatients" (C0029921), "Anti-Inflammatory Agents, Non Steroidal" (C0003211), and "Cyclooxygenase 2 Inhibitors" (C1257954), with the last being a specific class of NSAIDs.In this case, no expansion of the original graph query is performed, as all the relationships are present in the co-occurrence network.The Results page contains a list of 440 publications, whose abstracts discuss the concepts in the graph query from different perspectives and approaches.The top 3 results include work from Consolaro et al [27], a home-treatment algorithm based on anti-inflammatory drugs; Popovych et al [28], discussing the therapeutic efficacy of the BNO 1030 extract, which is a phytotherapeutic anti-inflammatory agent; and Sava et al [29], exposing the results of a 90-day treatment of patients with severe COVID-19 with a specific NSAID drug, tocilizumab.

UC4: Elevated Blood Glucose Levels and COVID-19 Severity
Elevated blood glucose levels are considered a risk factor for the severity of the disease.With GRAPH-SEARCH, we compose a Y-shaped graph query (Figure 8), expressing that high levels of blood glucose or increasing blood glucose can induce a severe illness.This example makes sophisticated use of utility terms; these are provided in a specific list of the concepts' browser of GRAPH-SEARCH.Consequently, we obtain a list of 395 results, where the top-ranked publication explains 5 out of 5 relationships.Logette et al [18] reported on the relationship between blood glucose levels and the severity of COVID-19.All following publications, ranked in descending order by the number of explained relationships of the original graph query, explain at most 3 out of 5 relations.

UC5: COVID-19, Angiotensin-Converting Enzyme 2, and Cardiovascular Diseases
Patel et al [30] hypothesized that the infection caused by SARS-CoV-2 could be associated with the shedding of angiotensin-converting enzyme 2 (ACE2).In their study, it is suggested that in patients with cardiovascular diseases, there is increased shedding of ACE2; consequently, higher levels of ACE2 in blood circulation are associated with the downregulation of membrane-bound ACE2.The graph query in Figure 9A expresses this query by connecting patients with COVID-19 infection with cardiovascular diseases; as they have more circulating ACE2, there is a downregulation of membrane-bound ACE2.When running this query, 2 relationships are not found in the co-occurrence network; the first paths suggested by the system as possible explanations are not meaningful with regard to the context; thus, we select alternative concepts, that is, "Subacute Endocarditis" and "Intensive Care Unit" (Figure 9B).Results can be ranked by the number of citations; we found the studies by Yamaguchi et al [31] and Gupta et al [32] particularly interesting, as they propose solutions for the prevention and treatment of the side effects of COVID-19 for patients with cardiovascular diseases.

UC6: COVID-19 Vaccines and Myocarditis
The side effects of vaccines are a topic of relevance.Here, we investigate the connection between events of heart inflammation (eg, myocarditis) among adolescents and the COVID-19 Moderna vaccine.We compose a graph query in GRAPH-SEARCH with 4 nodes (Figure 10A); a triangle is formed by "Adolescent (age group)" (C0205653), "Myocarditis" (C0027059), and the "Moderna COVID-19 Vaccine" (CIDO ID obo.VO 0005157); the vaccine entity is connected to the "COVID-19" (C5203670) node.COVID-19 and Moderna COVID-19 vaccine are not directly connected; among the possible paths suggested by GRAPH-SEARCH, the 2 scoring the highest sum of mutual information are through "Vaccination" and "Myopericarditis."The latter refers to both myocarditis and pericarditis (ie, the inflammation of the pericardium, which is the sac that surrounds the heart).The latter concept allows us to expand the initial query to complete the match with the co-occurrence network (Figure 10B).On the Results page, 190 bibliographic resources are provided.The top-ranked one, which explains all 4 relationships of the graph query, is a report by Gargano et al [33] that highlights the implications of the use of messenger RNA vaccines with a higher risk for myocarditis in male individuals aged 12 to 29 years.The following results do not explain the relationship between the COVID-19 Moderna vaccine and COVID-19 through myopericarditis, as they explain only 3 relations.These results, for instance, report adverse events of myocarditis after vaccination in the United States [34] and Korea [35].

Query Performances
GRAPH-SEARCH queries are composed of two computationally intensive steps: (1) the graph query matching over the co-occurrence network and (2) the retrieval and ranking of publications related to the query.For each such step, we run a performance analysis.Specifically, we simulated random queries with 2, 4, 6, 8, or 10 nodes from the existing co-occurrence network; we assume that these are the typical UC scenarios, as queries represent small queries of researchers created through the graphical interface.
We separately measure computation times of the first and second steps (shown in Figures 11A and 11B, respectively); each experiment has been repeated on 10 queries, generated randomly using the "Random walk with restarts sampling" method of Neo4j.We observe that the computational times for graph matching in all cases is <2.2 seconds, and its growth is less-than-linear with the number of the nodes, whereas the retrieval operation typically takes up to 3 seconds, with a small number of outliers due to cache misses; the resulting user delay in these scenarios seems quite acceptable.We also created random graph queries by removing increasing percentages of their relationships to simulate the difference between exact and inexact graph search (thereby triggering the search for alternative shortest paths); computational times (not shown for brevity) are not significantly affected.

Related Work
In this section, we review classic approaches to search over co-occurrence networks.Then, we focus on the specific use of bio-ontologies in information extraction systems, and finally, we propose a close comparison with COVID-19-specific search systems.

Semantic-Network Search
The task of searching and extracting literature documents over co-occurrence networks with graph-based queries can be considered through the subproblems that compose it.To query a co-occurrence network with a graph-like query, a similarity measure between graphs must be defined.Existing methods in the context of graph databases include definitions of graph edit distances and maximum common subgraphs [36], but a later approach introduced a similarity measure based on a graph kernel between pairs of documents, which exploits the shortest paths between nodes as units to compare graphs [37].Considering the construction of the co-occurrence networks from data sets of literature documents, different approaches are available to extract concepts to represent nodes in the network and connections between them.The survey by Han et al [38] and the study by Shi et al [39] present all the main methodologies and text mining pipeline architectures, which are applied in this study to engineering and design (ie, subsets of scientific literature).G-Bean [40] is also a relevant related work, that is, a graph-based tool that exploits ontologies for graph-based query expansion to support the user search intention discovery.

Literature Annotation and Bio-Ontologies
The incorporation of bio-ontologies in information extraction and information retrieval has demonstrated its efficacy through diverse applications, such as patent information retrieval [41] and identification of concept domains [42].Bio-ontologies are also applied in natural language processing tasks, such as NER [43].Moreover, Wang et al [44] illustrated the application of bio-ontologies in retrieving biomedical data sets, while Maraver et al [45] emphasized their role in literature search facilitation and metadata organization.The potential for refining search queries through ontology-guided expansion is also a recurring theme in the biomedical literature for information retrieval.Diaz-Galiano et al [46] and Dong et al [47] show query expansion methodologies using different medical vocabularies.
A fundamental aspect of research in this domain pertains to the availability and use of suitable corpora and data sets; previous studies [48,49] have provided foundational annotated and curated resources that underpin the experimental frameworks addressing these tasks.Lately, the integration of bio-ontologies with language models has also gained traction within the context of bioinformation extraction [50,51].
The CORD-19 data set received the widest attention.Several knowledge graphs that exploit this data set were proposed at the beginning of the pandemic for representing biomedical entities (eg, CORD-NER [55] and COVID-19 KG [56]) or publications metadata (eg, COVID-19-Literature [57]).More recently, CovidPubGraph [58] has provided a comprehensive and updated knowledge graph, which integrates information from multiple sources, making results available through a SPARQL end point.Finally, CovidGraph [59] exposed a knowledge graph in the Neo4j browser; several external ontologies are used to tag entities.The focus of these resources is more on organization and semantic enrichment than on exploration.
The goal of the TREC-COVID initiative [60] was to establish targeted retrieval tasks in response to the pandemic, to be shared and collectively addressed by the community.Instead, GRAPH-SEARCH aims to make the literature about COVID-19 searchable and explorable.This objective is common to other 2 systems, LitCovid and Outbreak.info;these support enhanced keyword-based search, but they do not offer any graph-based search support.
LitCovid [1] was developed within the US National Institutes of Health as a comprehensive resource of literature on COVID-19 (372,221 publications at the time of writing), updated regularly starting from PubMed.Publications are manually screened to assess their relevance to COVID-19.They are then categorized (eg, overview, disease mechanism, transmission dynamics, treatment, case report, and epidemic forecasting); assigned geographical locations; and annotated with any drug or chemical-related information found in their title and abstract, if applicable.The updated version [61] introduced the long-covid category, added annotations on variants and vaccines, and supported with machine learning algorithms the topic categorization (with a more updated model) and entity recognition (with NER).The interface allows us to apply filters on country, journal, drug, variant, and vaccine and compose search strings combining AND, OR, and NOT operators (ie, not documented); results are ranked by relevance, based on the widely used BM25 ranking function of Lucene.LitCovid positively compares its performances to the classical keyword search of PubMed (where annotations or tags are not used).
Outbreak.infoResearch Library [2] is a project of the Hughes, Su, Wu, and Andersen laboratories at Scripps Research.It offers a searchable interface of COVID-19 publications (complementing the content of LitCovid integrating preprint servers), together with clinical trials, data sets, protocols, and other resources.The data structure upon which the search is performed is supported by a schema; entities are connected by links with various semantics.The visual interface allows the use of some filters and keyword search; results are ranked by relevance based on the Lucene Practical Scoring Function on Elasticsearch (prioritizing the query normalization factor, coordination factor, term frequency, and inverse document frequency).

Discussion
In this section, we discuss how the proposed graph query search could be compared to other information extraction setups.For this purpose, we focus on 2 UC queries, that is, the linear query presented in UC3 (4 nodes in a linear pattern) and the red subgraph shown in UC1 (a nonlinear 6 nodes query, expanded with an additional node in GRAPH-SEARCH).

Comparison With COVID-19 Literature Search Systems
First, we considered running the UCs on the COVID-19 literature-dedicated search systems LitCovid and Outbreak.info.Both systems were queried using concept names corresponding to UMLS terms in the nodes; unfortunately, they both suffer from the limitations of Boolean search.Specifically, if we search with conjunctive clauses and exact search (eg, using "Outpatients" AND "Anti-Inflammatory Agents, Non Steroidal" AND "Cyclooxygenase 2 Inhibitors" AND "COVID-19" for UC3), no system returns any result.Dealing with exact search is hard.For instance, with LitCovid, the query "Cyclooxygenase Inhibitors" produces 3 results, whereas the query "Cyclooxygenase 2 Inhibitors" produces 5 results, although apparently more restrictive; instead, the query Cyclooxygenase Inhibitors (no quotes), without exact search, produces 12,287 results (including all references referring to generic inhibitors).Table 1 reports the results of LitCovid with conjunctive queries but no exact matching, while a similar search is not supported by Outbreak.info.In comparison, GRAPH-SEARCH reports 327 results for UC1 and 440 results for UC3.These outputs are hardly comparable, mainly because with LitCovid it is not possible to build a unique graph-shaped query; therefore, results of single conjunctive queries need to be evaluated one after the other, whereas GRAPH-SEARCH aggregates together the results of several conjunctive chains; it also expands given concepts with their acronyms (eg, "anti-inflammatory agents, non steroidal" is also searched as "NSAIDs").In addition, GRAPH-SEARCH allows for the expansion of specific links by adding new concepts (eg, "Up-Regulation [Physiology]" in UC1).No domain-specific system for COVID-19 supports graph-based search, allowing a more insightful comparison.

Comparison With the Search on Full-Text Indexed Corpora
We also attempted a comparison with search operations performed on a baseline created by full-text indexing the CORD-19 titles and abstracts.Specifically, we used the full-text indexing option of MariaDB, an open-source fork of MySQL [19].Typically, full-text indexes work well for regular text; they build an index over specific words rather than the whole text, and consequently, they show good performances for searches of specific words.The same queries used on LitCovid and Outbreak.infowere used on this setup: on MariaDB, we used the "Natural language mode" documented on MariaDB [62] and, thus, we removed the "AND" Boolean operators and parentheses.To be part of the index, words must appear in <50% of the documents to be considered potentially relevant and to be used in searches (consequently, "COVID-19" and "SARS-CoV-2" are not considered relevant).Results are returned in descending order of relevance; limitations include the exclusion of partial (or very short or long) words.
Notwithstanding our attempts, we note that the comparison of the GRAPH-SEARCH approach with the full-text indexing setup is very difficult for many reasons: 1.The databases upon which search is performed are built on different assumptions (eg, to be part of the index, words must appear in <50% of the documents; the co-occurrence network only includes entities that score high similarity with ontology concepts and exclude relationships with a negative NPMI). 2. In 1 case, we perform separate keyword search sessions with separate results (with associated precision and recall measures); in the other case, we retrieve aggregated results (with summarized measures).
3. On one side, the ranking produced is only on single query result sets; on the other side, it is a global ranking.
The results are reported in Table 1; they must be read considering all these aspects.Note that results achieved with keyword search are restricted to manipulating Boolean expressions, adding, keywords and dropping keywords.On the contrary, the results on GRAPH-SEARCH (327 and 440, respectively for UC1 and UC3) are inspectable, with ranking, ordering, filtering, and visualization options dedicated to the explained chains of entities; using our search paradigm, users can compose graph queries; more complex topologies also allow a stronger explainability of results.

Conclusions
GRAPH-SEARCH is the first search engine to propose the exploration of COVID-19 scientific literature using visual graph queries.GRAPH-SEARCH provides several unique features such as the possibility to describe concepts using well-known ontologies, to establish co-occurrence relationships between any 2 concepts of choice, to support search queries with concepts proposed and ranked by the system, and to browse resulting publications exploiting several visual and analytic measures.
The completeness and accuracy of the information captured in the co-occurrence network strictly depend on the advances of the NER methods used during the steps of entity mining and linking.Other systems have used expert curation (eg, LitCovid) or community-driven curation (eg, Outbreak.info).Although expert curation can improve the search experience, it does not properly scale; we opted for the exploitation of well-known biomedical ontologies such as UMLS and CIDO and state-of-the-art natural language processing models used for Entity Recognition in our data provision pipeline.
The ability of our system to extract results was evaluated, attempting a comparison with existing published systems (eg, LitCovid and Outbreak.info)and with full-text indexing search.We recognize that comparisons between the results retrieved from these systems are not ideal, as it is very critical to compare single search runs with a system where the result is built progressively on the graph-considering a set of aspects altogether (ie, how the network was built and pruned, shortest path computation, completion with additional nodes, and global ranking of results).
Co-occurrence networks are conventionally used for analyzing extensive text and big data.Common applications have involved sentiment analysis [63] and detection of prevailing topics [64].
Here, each node is a word occurring in a set of user-generated social media posts.Moreover, word-co-occurrence networks are present in clinical applications, for example, Millington and Luz [65] proposed to encode recordings of speech data used for recognizing patients with Alzheimer and controls.In all such cases, GRAPH-SEARCH may be used to find specific subgraphs and propose completions of missing links.
In this study, we have demonstrated the capability of domain-specific (even inexact) graph query matching when semantics is considered only for nodes; we are aware of the limitations of this approach, which, at this stage, is considered a modeling choice.In future work, we plan to extend our search system to semantically rich knowledge graphs with both entities and relationships, thereby enriching the expressivity of graph queries (also including the possibility to capture the semantics of relationships, with state-of-the-art methods [66] or as we already experimented in a previous study [67]).Then, we aim to formalize the use of graph queries in the context of graph databases by studying the complexity of graph search and connecting it to classical theories of subgraph matching, shortest path search, and conjunctive query processing.
We also aim to conduct extensive empirical studies to measure user satisfaction with systems such as GRAPH-SEARCH analyzed along the 3 dimensions of usability, usefulness in deepening their knowledge of certain connected topics, and support of user's intentions in knowledge exploration.

Figure 1 .
Figure 1.Euler-Venn diagram of the overlap of publications with abstract and publications with full-text JSON from PDF or from PubMedCentral (PMC) in the COVID-19 Open Research Dataset (CORD-19).

Figure 2 .
Figure 2. Line plot showing the 10-base logarithm of the number of publications (y-axis) per publish time and date (x-axis).

Figure 3 .
Figure3.Rationale of co-occurrence network construction.Ontological terms are recognized in textual abstracts using entity recognition; then, this process is reiterated with approximately 660,000 publications' abstracts.Terms are connected to each other using entity linking; each relationship between entities is associated with several properties representing the co-occurrence weight, using different statistical methods.The generated connected co-occurrence network has approximately 128,000 concepts and approximately 47 million relationships.ACE2: angiotensin-converting enzyme 2; CIDO: Coronavirus Infectious Disease Ontology; NPMI: normalized pointwise mutual information; UMLS: Unified Medical Language System.

Figure 4 .
Figure 4. Workflow diagram of the GRAPH-SEARCH data provision pipeline.Tasks are performed sequentially; each task uses data objects and produces data objects, starting from the raw COVID-19 Open Research Dataset (CORD-19) metadata.csvfile present in CORD-19, which is translated into metadata.csvonce cleaned.The final outcome of the pipeline is a Neo4j database containing the network.

4 .
Retrieve publications and return ranking to the user (Figure6D): The system collects the names of all the relationships from the expanded graph query (computed in step B and selected in step C) and exploits them to retrieve the posting lists of publications (from the inverted index).It computes the relationships explained by each publication.Then, it ranks the publications by (1) the number of explained relationships of the original graph query (refer to equation 1), (2) the sum of NPMI scores of the relationships, and (3) the publication date.Finally, it shows the complete list with the publications' metadata.

Figure 6 .
Figure 6.(A) Example of a graph query with 6 concepts and 5 relationships.(B) Match of graph query on the co-occurrence network, with the search of shortest paths (in the dashed spaces called expansions).Considering the relationship between SARS-CoV-2 and angiotensin-converting enzyme 2, its expansion includes 3 paths of length 3, each characterized by 2 intermediate nodes.Light green paths have the highest average normalized pointwise mutual information (NPMI) of each expansion.(C) Regardless of the suggested paths with the highest average NPMI, users can select any path (dark green).(D) A list of publications, ranked by their score, is extracted; the score is computed using equation 1 and considers all the relationships in the selected paths that are mentioned in the publication.ACE2: angiotensin-converting enzyme 2; AngII: angiotensin II.
UC1 emphasizes the strength of exploratory search over graphs by supporting users in selecting graph portions, considering eventually accepting proposed expansions, and browsing results in terms of NPMI and explained relationships.UCs of increasing complexity are provided next, offering examples of searches upon graph queries with different shapes: UC2 and UC3 introduce very simple linear graph queries (ie, 1 chain of nodes), UC4 shows the use of a Y-shaped graph query, and UC5 and UC6 represent more complex shapes with nodes forming triangles.

Figure 7 .
Figure 7. GRAPH-SEARCH screens dedicated to use case 1 (UC1): (A) graph query, (B) find paths, (C) select paths, and (D) first publication on the results page.

Figure 8 .
Figure 8. Graph query of use case 4 (UC4), with Unified Medical Language System concepts IDs in red.

Figure 11 .
Figure 11.Box plots measuring the time for (A) the graph matching operation and (B) the publication retrieval operation performed using complete graph queries of 2, 4, 6, 8, and 10 nodes (each repeated 10 times).
is a community-driven open-source biomedical ontology in the area of COVID-19.
they closely resemble those of cystic fibrosis, a minor observation unrelated to clinical severity.In general, the lack of relevant clinical references confirmed our expectation that cystic fibrosis did not impact COVID-19 severity.
How has their connection been investigated in CORD-19?The simplest possible graph query in GRAPH-SEARCH holds 2 nodes (ie, cystic fibrosis and COVID-19) connected by 1 relationship of co-occurrence."Cystic fibrosis" is represented by UMLS concept ID C0010674, and "COVID-19" is represented by the UMLS concept ID C5203670.The 2 concepts are not directly connected within the network; among the proposed paths in the expansion, we choose the one through the concept "Respiratory secretion viscosity alteration" (UMLS ID 3537094).Only 1 publication in CORD-19 explains this path, covering it completely, with an NPMI sum of 0.5668.Kratochvil et al [25] characterized the composition of respiratory secretions of intubated patients with COVID-19 infection, finding that

Table 1 .
Results of the evaluation of use case (UC) 1 (Figure7) and UC3 queries when performed on the LitCovid search interface, on the full-text indexed MariaDB database, and on GRAPH-SEARCH.
©Francesco Invernici, Anna Bernasconi, Stefano Ceri.Originally published in the Journal of Medical Internet Research (https://www.jmir.org),30.05.2024.This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/),which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited.The complete bibliographic information, a link to the original publication on https://www.jmir.org/,as well as this copyright and license information must be included.